Exploratory Data Analysis: Boston Housing Dataset

Author

BST 236

Introduction

The Boston Housing dataset contains 506 observations and 14 variables describing housing in the Boston area. The target variable is MEDV (median value of owner-occupied homes in $1000s). The 13 predictor variables are:

Variable Description
CRIM Per capita crime rate by town
ZN Proportion of residential land zoned for lots over 25,000 sq. ft.
INDUS Proportion of non-retail business acres per town
CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX Nitric oxides concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Proportion of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centres
RAD Index of accessibility to radial highways
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents by town
LSTAT Percentage lower status of the population

Setup

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yaml

# Load config
with open('config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Visualization settings
sns.set_style(config['visualization']['style'])
plt.rcParams['font.size'] = config['visualization']['font_size']

# Load data
df = pd.read_csv('../' + config['data']['processed_data'])
print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
Dataset loaded: 506 rows, 14 columns

Basic Data Overview

Shape and Data Types

Code
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
Shape: (506, 14)

Data types:
CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
MEDV       float64
dtype: object

First Few Rows

Code
df.head(10)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394.12 5.21 28.7
6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0 15.2 395.60 12.43 22.9
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396.90 19.15 27.1
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386.63 29.93 16.5
9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0 15.2 386.71 17.10 18.9

Summary Statistics

Code
df.describe().round(2)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
count 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00
mean 3.61 11.36 11.14 0.07 0.55 6.28 68.57 3.80 9.55 408.24 18.46 356.67 12.65 22.53
std 8.60 23.32 6.86 0.25 0.12 0.70 28.15 2.11 8.71 168.54 2.16 91.29 7.14 9.20
min 0.01 0.00 0.46 0.00 0.38 3.56 2.90 1.13 1.00 187.00 12.60 0.32 1.73 5.00
25% 0.08 0.00 5.19 0.00 0.45 5.89 45.02 2.10 4.00 279.00 17.40 375.38 6.95 17.02
50% 0.26 0.00 9.69 0.00 0.54 6.21 77.50 3.21 5.00 330.00 19.05 391.44 11.36 21.20
75% 3.68 12.50 18.10 0.00 0.62 6.62 94.07 5.19 24.00 666.00 20.20 396.22 16.96 25.00
max 88.98 100.00 27.74 1.00 0.87 8.78 100.00 12.13 24.00 711.00 22.00 396.90 37.97 50.00

Missing Values

Code
missing = df.isnull().sum()
print(f"Total missing values: {missing.sum()}")
if missing.sum() > 0:
    print(missing[missing > 0])
else:
    print("No missing values in any column.")
Total missing values: 0
No missing values in any column.

Univariate Analysis

Histograms with KDE

Code
fig, axes = plt.subplots(4, 4, figsize=(14, 12))
axes = axes.flatten()

for i, col in enumerate(df.columns):
    sns.histplot(df[col], kde=True, ax=axes[i], color='steelblue', edgecolor='white')
    axes[i].set_title(col, fontsize=11, fontweight='bold')
    axes[i].set_xlabel('')

# Hide unused subplots
for j in range(len(df.columns), len(axes)):
    axes[j].set_visible(False)

plt.suptitle('Distribution of All Variables', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Box Plots

Code
fig, axes = plt.subplots(4, 4, figsize=(14, 12))
axes = axes.flatten()

for i, col in enumerate(df.columns):
    sns.boxplot(y=df[col], ax=axes[i], color='steelblue')
    axes[i].set_title(col, fontsize=11, fontweight='bold')

for j in range(len(df.columns), len(axes)):
    axes[j].set_visible(False)

plt.suptitle('Box Plots for Outlier Detection', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Bivariate Analysis

Correlation Heatmap

Code
corr = df.corr()

# Lower triangle mask
mask = np.triu(np.ones_like(corr, dtype=bool))

fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(
    corr, mask=mask, annot=True, fmt='.2f',
    cmap='RdBu_r', center=0, vmin=-1, vmax=1,
    square=True, linewidths=0.5, ax=ax,
    cbar_kws={'shrink': 0.8}
)
ax.set_title('Correlation Matrix (Lower Triangle)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Scatter Plots: Each Feature vs MEDV

Code
features = [c for c in df.columns if c != 'MEDV']

fig, axes = plt.subplots(4, 4, figsize=(16, 16))
axes = axes.flatten()

for i, feat in enumerate(features):
    sns.regplot(x=df[feat], y=df['MEDV'], ax=axes[i],
                scatter_kws={'alpha': 0.4, 's': 15, 'color': 'steelblue'},
                line_kws={'color': 'red', 'linewidth': 1.5})
    r = df[feat].corr(df['MEDV'])
    axes[i].set_title(f'{feat} (r={r:.2f})', fontsize=11, fontweight='bold')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('MEDV' if i % 4 == 0 else '')

for j in range(len(features), len(axes)):
    axes[j].set_visible(False)

plt.suptitle('Feature vs MEDV with Regression Lines', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Multivariate Analysis

Pairplot of Top 5 Correlated Features with MEDV

Code
corr_with_medv = corr['MEDV'].drop('MEDV').abs().sort_values(ascending=False)
top5 = corr_with_medv.head(5).index.tolist()

print(f"Top 5 features most correlated with MEDV: {top5}")

pairplot_cols = top5 + ['MEDV']
g = sns.pairplot(df[pairplot_cols], diag_kind='kde',
                 plot_kws={'alpha': 0.4, 's': 15, 'color': 'steelblue'},
                 diag_kws={'color': 'steelblue'})
g.figure.suptitle('Pairplot: Top 5 Features + MEDV', fontsize=14, fontweight='bold', y=1.01)
plt.show()
Top 5 features most correlated with MEDV: ['LSTAT', 'RM', 'PTRATIO', 'INDUS', 'TAX']

Key Insights

Highly Correlated Feature Pairs (|r| > 0.7)

Code
# Find pairs with |r| > 0.7 (excluding self-correlations)
threshold = 0.7
high_corr = []
for i in range(len(corr.columns)):
    for j in range(i + 1, len(corr.columns)):
        if abs(corr.iloc[i, j]) > threshold:
            high_corr.append({
                'Feature 1': corr.columns[i],
                'Feature 2': corr.columns[j],
                'Correlation': round(corr.iloc[i, j], 3)
            })

high_corr_df = pd.DataFrame(high_corr).sort_values('Correlation', key=abs, ascending=False)
print(f"Feature pairs with |r| > {threshold}:\n")
if len(high_corr_df) > 0:
    print(high_corr_df.to_string(index=False))
else:
    print("None found.")
Feature pairs with |r| > 0.7:

Feature 1 Feature 2  Correlation
      RAD       TAX        0.910
      NOX       DIS       -0.769
    INDUS       NOX        0.764
      AGE       DIS       -0.748
    LSTAT      MEDV       -0.738
      NOX       AGE        0.731
    INDUS       TAX        0.721
    INDUS       DIS       -0.708

Feature Correlations with MEDV (Ranked by |r|)

Code
medv_corr = corr['MEDV'].drop('MEDV').sort_values(key=abs, ascending=False)
medv_corr_df = pd.DataFrame({
    'Feature': medv_corr.index,
    'Correlation with MEDV': medv_corr.values.round(3),
    '|r|': abs(medv_corr.values).round(3)
})
print(medv_corr_df.to_string(index=False))
Feature  Correlation with MEDV   |r|
  LSTAT                 -0.738 0.738
     RM                  0.695 0.695
PTRATIO                 -0.508 0.508
  INDUS                 -0.484 0.484
    TAX                 -0.469 0.469
    NOX                 -0.427 0.427
   CRIM                 -0.388 0.388
    RAD                 -0.382 0.382
    AGE                 -0.377 0.377
     ZN                  0.360 0.360
      B                  0.333 0.333
    DIS                  0.250 0.250
   CHAS                  0.175 0.175

Skewed Distributions (|Skewness| > 1)

Code
skewness = df.skew().sort_values(key=abs, ascending=False)
skewed = skewness[abs(skewness) > 1]
skew_df = pd.DataFrame({
    'Variable': skewed.index,
    'Skewness': skewed.values.round(3)
})
print(f"Variables with |skewness| > 1:\n")
if len(skew_df) > 0:
    print(skew_df.to_string(index=False))
else:
    print("None found.")
Variables with |skewness| > 1:

Variable  Skewness
    CRIM     5.223
    CHAS     3.406
       B    -2.890
      ZN     2.226
    MEDV     1.108
     DIS     1.012
     RAD     1.005

Outlier Counts (IQR Method)

Code
outlier_counts = {}
for col in df.columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    n_outliers = ((df[col] < lower) | (df[col] > upper)).sum()
    outlier_counts[col] = n_outliers

outlier_df = pd.DataFrame({
    'Variable': outlier_counts.keys(),
    'Outlier Count': outlier_counts.values()
}).sort_values('Outlier Count', ascending=False)

print("Outlier counts per variable (IQR method):\n")
print(outlier_df.to_string(index=False))
Outlier counts per variable (IQR method):

Variable  Outlier Count
       B             77
      ZN             68
    CRIM             66
    MEDV             40
    CHAS             35
      RM             30
 PTRATIO             15
   LSTAT              7
     DIS              5
   INDUS              0
     NOX              0
     AGE              0
     RAD              0
     TAX              0

Summary

This EDA reveals several important characteristics of the Boston Housing dataset:

  • No missing values — the dataset is complete and ready for modeling.
  • LSTAT and RM are the two features most strongly correlated with MEDV, suggesting they are likely to be important predictors.
  • Several feature pairs exhibit high collinearity (e.g., RAD–TAX, DIS–AGE), which may affect linear models.
  • Multiple variables show significant skewness (e.g., CRIM, ZN, B), indicating potential benefit from log or power transformations.
  • The IQR method identifies outliers in most variables, with CRIM and B showing the highest counts.